Comparing the Performance of Several Popular Machine Learning Algorithms on Classifying TATA-box from putative TATA box
نویسنده
چکیده
A TATA box is a common transcription binding site that occurs in the upstream of a transcription start site of many genes. Identifying a TATA box accurately is important since it has been shown empirically that a transcription start site (TSS) occurs in the downstream of a TATA box after a fixed distance that is only dependent on the species. Unfortunately, many substrings of a DNA sequence fit to the profile of a TATA box and such substrings are called putative TATA boxes. Identification of a TATA box among putative TATA boxes will improve the accuracy of determining a TSS and hence the detection of a promoter. In this paper we investigate the effectiveness of several popular machine learning algorithms for discriminating TATA box from putative TATA boxes. These algorithms include naïve Bayes, artificial neural network, decision tree C4.5, random forest and support vector machine. To compare the effectiveness, we use a metric of prediction accuracy, true and false positive. Empirically we have shown that tuned support vector machine has outperformed all other machine learning algorithms
منابع مشابه
An inspection of the domain between putative TATA box and translation start site in 79 plant genes.
Over 75 published genomic DNA sequences from several higher plants have been collected and flanking regions of the leader sequences have been analysed. In a majority of the plants, the first AUG codon on processed mRNA acted as a translation initiation site. The consensus sequence for the context was TAAACAATGGCT (on plus strand of DNA). This differed from the earlier suggestion for eukaryotic ...
متن کاملNeural-statistical Model of Tata-box Motifs in Eukaryotes
The TATA-box is one of the most important binding sites in eukaryotic Polymerase II promoters. It is also one of the most common motifs in these promoters. The TATA-box is responsible mainly for the proper localization of the transcription start site (TSS) by the biochemical mechanism of DNA transcription. It also has very regular distances from the TSS. Accurate computational recognition of th...
متن کاملMinimal components of the RNA polymerase II transcription apparatus determine the consensus TATA box
In Saccharomyces cerevisiae, multiple approaches have arrived at a consensus TATA box sequence of TATA(T/A)A(A/T)(A/G). TATA-binding protein (TBP) affinity alone does not determine TATA box function. To discover how a minimal set of factors required for basal and activated transcription contributed to the sequence requirements for a functional TATA box, we performed transcription reactions usin...
متن کاملAn inverted TATA box directs downstream transcription of the bone sialoprotein gene.
The orientation of the TATA box is thought to direct downstream transcription of eukaryotic genes by RNA polymerase II. However, the putative TATA box in the promoter of the bone sialoprotein (BSP) gene, which codes for a tissue-specific and developmentally regulated bone matrix protein, is inverted (5'-TTTATA-3') relative to the consensus TATA box sequence (5'-TATAAA-3') and is overlapped by a...
متن کاملNearest-neighbor non-additivity versus long-range non-additivity in TATA-box structure and its implications for TBP-binding mechanism
TBP recognizes its target sites, TATA boxes, by recognizing their sequence-dependent structure and flexibility. Studying this mode of TATA-box recognition, termed 'indirect readout', is important for elucidating the binding mechanism in this system, as well as for developing methods to locate new binding sites in genomic DNA. We determined the binding stability and TBP-induced TATA-box bending ...
متن کامل